



# HSA MEMORY MODEL HOT CHIPS TUTORIAL - AUGUST 2013

BENEDICT R GASTER WWW.QUALCOMM.COM

#### OUTLINE



HSA Memory Model

- OpenCL 2.0
  - Has a memory model too

#### Obstruction-free bounded deques

• An example using the HSA memory model



# HSA MEMORY MODEL

## TYPES OF MODELS



- Shared memory computers and programming languages, divide complexity into models:
  - 1. Memory model specifies safety
    - e.g. can a work-item prevent others from progressing?
    - This is what this section of the tutorial will focus on
  - 2. Execution model specifies liveness
    - Described in Ben Sander's tutorial section on HSAIL
    - e.g. can a work-item prevent others from progressing
  - 3. Performance model specifies the big picture
    - e.g. caches or branch divergence
    - Specific to particular implementations and outside the scope of today's tutorial

#### THE PROBLEM



- Assume all locations (a, b, ...) are initialized to 0
- What are the values of \$s2 and \$s4 after execution?

Work-item 0

Work-item 1

mov\_u32 \$s1, 1 ;
st\_global\_u32 \$s1, [&a] ;
ld\_global\_u32 \$s2, [&b] ;

mov\_u32 \$s3, 1 ;
st\_global\_u32 \$s3, [&b] ;
ld\_global\_u32 \$s4, [&a] ;

\*b = 1; int y = \*a;

initially 
$$*a = 0 \&\& *b = 0$$

## THE SOLUTION



#### The memory model tells us:

- Defines the visibility of writes to memory at any given point
- Provides us with a set of possible executions



- Programmability ; A good model should make it (relatively) easy to write multiwork-item programs. The model should be intuitive to most users, even to those who have not read the details
- Performance ; A good model should facilitate high-performance implementations at reasonable power, cost, etc. It should give implementers broad latitude in options
- Portability ; A good model would be adopted widely or at least provide backward compatibility or the ability to translate among models

\* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993.

## SEQUENTIAL CONSISTENCY (SC)\*



#### Axiomatic Definition

- A single processor (core) sequential if "the result of an execution is the same as if the operations had been executed in the order specified by the program."
- A multiprocessor sequentially consistent if "the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program."
- But HW/Compiler actually implements more relaxed models, e.g. ARMv7



\* L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocessor Programs. *IEEE Transactions on Computers*, C-28(9):690–91, Sept. 1979.

#### SEQUENTIAL CONSISTENCY (SC)



Work-item 0

Work-item 1

mov\_u32 \$s1, 1 ;
st\_global\_u32 \$s1, [&a] ;
ld\_global\_u32 \$s2, [&b] ;

mov\_u32 \$s3, 1 ; st\_global\_u32 \$s3, [&b] ; ld\_global\_u32 \$s4, [&a] ;

```
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s3, [&b] ;
```

\$s2 = 0 && \$s4 = 1



- Sequential consistency is (reasonably) easy to understand, but limits optimizations that the compiler and hardware can perform
- Many modern processors implement many reordering optimizations
  - Store buffers (TSO\*), work-items can see their own stores early
  - Reorder buffers (XC\*), work-items can see other work-items store early

\*TSO – Total Store Order as implemented by Sparc and x86
\*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno

#### **RELAXED CONSISTENCY (XC)**



Work-item 0

Work-item 1

mov\_u32 \$s1, 1 ;
st\_global\_u32 \$s1, [&a] ;
ld\_global\_u32 \$s2, [&b] ;

mov\_u32 \$s3, 1 ; st\_global\_u32 \$s3, [&b] ; ld\_global\_u32 \$s4, [&a] ;

```
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
ld_global_u32 $s2, [&b] ;
ld_global_u32 $s2, [&b] ;
st_global_u32 $s4, [&a] ;
st_global_u32 $s1, [&a] ;
st_global_u32 $s3, [&b] ;
```

\$s2 = 0 && \$s4 = 0





- Programmability ; XC is really pretty hard for the programmer to reason about what will be visible when
  - many memory model experts have been known to get it wrong!
- Performance ; XC is good for performance, the hardware (compiler) is free to reorder many loads and stores, opening the door for performance and power enhancements

• Portability ; XC is very portable as it places very little constraints

#### MY CHILDREN AND COMPUTER ARCHITECTS ALL WANT



To have their cake and eat it!

HSA Provides: The ability to enable programmers to reason with (relatively) intuitive model of SC, while still achieving the benefits of XC!



- HSA, following Java, C++11, and OpenCL 2.0 adopts SC for Data Race Free (DRF)
  - plus some new capabilities !
- (Informally) A data race occurs when two (or more) work-items access the same memory location such that:
  - At least one of the accesses is a WRITE
  - There are no intervening synchronization operations

#### • SC for DRF asks:

- Programmers to ensure programs are DRF under SC
- Implementers to ensure that all executions of DRF programs on the relaxed model are also SC executions

\*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990



#### HSA's memory model is based on RC<sub>SC</sub>:

- All Id\_acq and st\_rel are SC
  - Means coherence on all Id\_acq and st\_rel to a single address.
  - All Id\_acq and st\_rel are program ordered per work-item (actually: sequenceorder by language constraints)
- Similar model adopted by ARMv8

 HSA extends RC<sub>SC</sub> to SC for HRF<sup>\*</sup>, to access the full capabilities of modern heterogeneous systems, containing CPUs, GPUs, and DSPs, for example.

\*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R. Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC'13.



Work-item 0

Work-item 1

mov\_u32 \$s1, 1 ;
st\_global\_u32\_rel \$s1, [&a] ;
ld\_global\_u32\_acq \$s2, [&b] ;

mov\_u32 \$s3, 1 ; st\_global\_u32\_rel \$s3, [&b] ; ld\_global\_u32\_acq \$s4, [&a] ;

```
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32_rel $s1, [&a] ;
ld_global_u32_acq $s2, [&b] ;
st_global_u32_rel $s3, [&b] ;
ld_global_u32_acq $s4, [&a] ;
```

\$s2 = 0 && \$s4 = 1

Two memory accesses participate in a data race if they

- access the same location
- at least one access is a store
- can occur simultaneously
  - i.e. appear as adjacent operations in interleaving.
- A program is data-race-free if no possible execution results in a data race.
- Sequential consistency for data-race-free programs
  - Avoid everything else

HSA: Not good enough!



#### ALL ARE NOT EQUAL – OR SOME CAN **SEE BETTER THAN OTHERS**







**Dimension X** 

Dimen

#### DATA-RACE-FREE IS NOT ENOUGH





## SEQUENTIAL CONSISTENCY FOR HETEROGNEOUS-RACE-FREE



#### Two memory accesses participate in a heterogeneous race if

- access the same location
- at least one access is a store
- can occur simultaneously
  - i.e. appear as adjacent operations in interleaving.
- Are not synchronized with "enough" scope
- A program is heterogeneous-race-free if no possible execution results in a heterogeneous race.
- Sequential consistency for heterogeneous-race-free programs
  - Avoid everything else



#### HRF0: Basic Scope Synchronization

- "enough" = both threads synchronize using *identical* scope
- Recall example:
  - Contains a heterogeneous race in HSA



## HOW TO USE HSA WITH SCOPES



Want: For performance, use smallest scope possible What is safe in HSA?



HSA Scope Selection Guideline

Is this a valid assumption?

#### Implication:



Producers/consumers must be known at synchronization time

#### **REGULAR GPGPU WORKLOADS**





Well defined (regular) data partitioning +

Well defined (regular) synchronization pattern =

Producer/consumers are always known

## IRREGULAR WORKLOADS



#### • HSA: example is race

- Must upgrade wg (workgroup) -> plat (platform)
- HSA memory model says:
  - Id \$s1, [&x], will see value (1)!





# OPENCL 2.0 HAS A MEMORY MODEL TOO MAPPING ONTO HSA'S MEMORY MODEL



- Provisional specification released at SIGGRAPH'13, July 2013.
- Huge update to OpenCL to account for the evolving hardware landscape and emerging use cases (e.g. irregular work loads)
- Key features:
  - Shared virtual memory, including platform atomics
  - Formally defined memory model based on C11 plus support for scopes
    - Includes an extended set of C1X atomic operations
  - Generic address space, that subsumes global, local, and private
  - Device to device enqueue
  - Out-of-order device side queuing model
  - Backwards compatible with OpenCL 1.x



## OPENCL 1.X MEMORY MODEL MAPPING

 It is straightforward to provide a mapping from OpenCL 1.x to the proposed model

| <b>OpenCL Operation</b> | HSA Memory Model<br>Operation               |
|-------------------------|---------------------------------------------|
| Load                    | ld_global_wg<br>ld_group_wg                 |
| Store                   | st_global_wg<br>st_group_wg                 |
| atomic_op               | atomic_op_global_comp<br>atomic_op_group_wg |
| barrier()               | sync_wg                                     |

- OpenCL 1.x atomics are unordered and so map to atomic\_op\_X
- Mapping for fences not shown but straightforward

## OPENCL 2.0 MEMORY MODEL MAPPING



| OpenCL Operation              | HSA Memory Model Operation                      |
|-------------------------------|-------------------------------------------------|
| Load<br>memory_order_relaxed  | atomic_ld_[global   group]_scope                |
| Store<br>Memory_order_relaxed | atomic_st_[global   group]_scope                |
| Load<br>memory_order_acquire  | ld_[global   group]_acq_scope                   |
| Load<br>memory_order_seq_cst  | fence_rel_scope ; ld_[global   group]_acq_scope |
| Store<br>memory_order_release | st_[global   group]_rel_scope                   |
| Store<br>Memory_order_seq_cst | st_[global   group]_rel_scope ; fence_acq_scope |
| memory_order_acq_rel          | atomic_op_[global   group]_acq_rel_scope        |
| memory_order_seq_cst          | atomic_op_[global group]_acq_rel_scope          |



| OpenCL Scope                 | HSA Scope  |
|------------------------------|------------|
| memory_scope_work_group      | _wg        |
| memory_scope_device          | _component |
| memory_scope_all_svm_devices | _platform  |



# OBSTRUCTION-FREE BOUNDED DEQUES

AN EXAMPLE USING THE HSA MEMORY MODEL



- Why do we need such a memory model in practice?
- One important application of memory consistency is in the development and use of concurrent data-structures
- In particular, there are a class data-structures implementations that provide non-blocking guarantees:
  - wait-free; An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes
    - In practice very hard to build efficient data-structures that meet this requirement
  - lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of the work-items (or threads) makes progress
    - In practice lock-free algorithms are implemented by work-item cooperating with one enough to allow progress
  - Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can make progress

# BUT WAY NOT JUST USE MUTUAL EXCLUSION?





Diversity in a heterogeneous system, such as different clock speeds, different scheduling policies, and more can mean traditional mutual exclusion is not the right choice

.

Fabric & Memory Controller

## **CONCURRENT DATA-STRUCTURES**



- Emerging heterogeneous compute clusters means we need:
  - To adapt existing concurrent data-structures
  - Developer new concurrent data-structures
- Lock based programming may still be useful but often these algorithms will need to be lock-free
- Of course, this is a key application of the HSA memory model
- To showcase this we highlight the development of a well known (HLM) obstruction-free deque\*

\*Herlihy, M. et al. 2003. Obstruction-free synchronization: double-ended queues as an example. (2003), 522–529.

#### HLM - OBSTRUCTION-FREE DEQUE



- Uses a fixed length circular queue
- At any given time, reading from left to right, the array will contain:
  - Zero or more left-null (LN) values
  - Zero or more dummy-null (DN) values
  - Zero or more right-null (RN) values
- At all times there must be:
  - At least two different nulls values
    - At least one LN or DN, and at least one DN or RN
- Memory consistency is required to allow multiple producers and multiple consumers, potentially happening in parallel from the left and right ends, to see changes from other work-items (HSA Components) and threads (HSA Agents)



#### HLM - OBSTRUCTION-FREE DEQUE



Key:

| LN | <ul> <li>left null value</li> </ul> |
|----|-------------------------------------|
| RN | - right null value                  |
| V  | – value                             |

- left left hint index
- right right hint index

## C REPRESENTATION OF DEQUE



#### struct node {

- uint64\_t type : 2; // null type (LN, RN, DN)
- uint64\_t counter : 8 ; // version counter to avoid ABA
- uint64\_t value : 54 ; // index value stored in queue

```
}
```

```
struct queue {
```

- unsigned int size; // size of bounded buffer
- node \* array; // backing store for deque itself

#### HSAIL REPRESENTATION



• Allocate a deque in global memory using HSAIL

@deque\_instance:

align 64 global\_u32 &size;

align 8 global\_u64 &array;





#### • Assume a function:

function &rcheck\_oracle (arg\_u32 %k, arg\_u64 %left, arg\_u64 %right) (arg\_u64 %queue);

#### • Which given a deque

- returns (%k) the position of the left most of RN
  - Id\_global\_acq used to read node from array
  - Makes one if necessary (i.e. if there are only LN or DN)
  - atomic\_cas\_global\_ar, required to make new RN
- returns (%left) the left node (i.e. the value to the left of the left most RN position)
- returns (%right) the right node (i.e. the value at position (%k))

## **RIGHT POP**



```
function & right pop(arg u32err, arg u64 % result) (arg u64 % degue) {
 // load queue address
 ld_arg_u64 $d0, [%deque];
@loop forever:
 // setup and call right oracle to get next RN
  arg_u32 %k; arg_u64 %current; arg_u64 %next;
  call &rcheck_oracle(%queue);
 Id arg_u32 $s0, [%k]; Id arg_u64 $d1, [%current]; Id arg_u64 $d2, [%next];
 // current.value($d5)
  shr_u64 $d5, $d1, 62;
 // current.counter($d6)
  and_u64 $d6, $d1,
  0x3FC0000000000000;
  shr u64 $d6, $d6, 54;
// current.value($d7)
  and u64 $d7, $d1, 0x3FFFFFFFFFFF;
// next.counter($d8)
  and u64 $d8, $d2, 0x3FC0000000000000; shr u64 $d8, $d8, 54; brn @loop forever;
}
```

#### **RIGHT POP – TEST FOR EMPTY**



```
// current.type($d5) == LN || current.type($d5) == DN
cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;
or_b1 $c0, $c0, $c1;
cbr $c0, @not_empty;
// current node index (%deque($d0) + (%k($s1) - 1) * 16)
add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;
ld_global_acq_u64 $d4, [$d3];
cmp_neq_b1_u64 $c0, $d4, $d1;
cbr $c0, @not_empty;
st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY
%ret
@not empty:
```



#### RIGHT POP – TRY READ/REMOVE NODE

// \$d9 = (RN, next.cnt+1, 0)
add\_u64 \$d8, \$d8, 1;
shl\_u64 \$d9, RN, 62;
and\_u64 \$d8, \$d8, \$d9;

// cas(deq+k, next, node(RN, next.cnt+1, 0))
atomic\_cas\_global\_ar\_u64 \$d9, [\$s0], \$d2, \$d9;
cmp\_neq\_u64 \$c0, \$d9, \$d2;
cbr \$c0, @cas\_failed;

#### // \$d9 = (RN, current.cnt+1, 0)

add\_u64 \$d6, \$d6, 1; shl\_u64 \$d9, RN, 62; and\_u64 \$d9, \$d6, \$d9;

// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)
atomic\_cas\_global\_ar\_u64 \$d9, [\$s1], \$d1, \$d9;
cmp\_neq\_u64 \$c0, \$d9, \$d1;
cbr \$c0, @cas\_failed;
st\_arg\_u32 SUCCESS, [&err];
st\_arg\_u64 \$d7, [&value];
%ret
@cas\_failed:
// loop back around and try again

© Copyright 2012 HSA Foundation. All Rights Reserved.

#### TAKE AWAYS

- HSA provides a powerful and modern memory model
  - Based on the well know SC for DRF
  - Defined as Release Consistency
  - Extended with scopes as defined by HRF
- OpenCL 2.0 introduces a new memory model
  - Also based on SC for DRF
  - Also defined in terms of Release Consistency
  - Also Extended with scope as defined in HRF
  - Has a well defined mapping to HSA
- Concurrent algorithm development for emerging heterogeneous computing cluster can benefit from HSA and OpenCL 2.0 memory models

